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f"""! Abstract 

Motivation: Probe-level models have led to improved performance in microarray 
X/i studies b\it the various sources of probc-lcvel contamination are still poorly understood. 

, Data-driven analysis of probe performance can be used to quantify the uncertainty in 

individual probes and to highlight the relative contribution of different noise sources. 
T-H Improved understanding of the probe-level effects can lead to improved preprocessing 

^ techniques and microarray design. 

Results: We have implemented probabilistic tools for probe performance analysis and 
summarization on short oligomiclcotidc arrays. In contrast to standard preprocessing 
approaches, the methods provide quantitative estimates of probe-specific noise and 
affinity terms and tools to investigate these parameters. Tools to incorporate prior 
information of the probes in the analysis arc provided as well. Comparisons to known 
^-H probe-level error sources and spike-in data sets validate the approach. 

T-H Availability: Implementation is freely available in R/BioConductor: 

http://www.bioconductor.org/packages/release/bioc/html/RPA.html. 
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1 Introduction 



Probe defects are a major source of noise and uncertainty in microarray studies. Probe 

performance is affected by RNA degradation, non-specific hybridization, annotation errors 
and other, potentially unknown factors. While the use of multiple probes and modeling 
of the probe effects through probe-specific parameters have been shown to yield improved 
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Figure 1: Demonstration of probe set-level signal estimation for probe set 200046 _at in the 
ALLMLL example data set in BioConductor. Top: the 11 probe-level signals across 20 
measurement samples are indicated by gray lines; the black line illustrates the probeset- 
level summary estimate. Middle: estimated noise level (standard deviation) for each probe. 
Bottom: probe affinity effects with respect to the probeset-level signal estimate. 



estimates of the target signal [6, 8], the various sources of the probe-level noise and their 
relative contributions remain poorly understood. 

We introduce targeted probabilistic tools for investigating probe performance directly based 
on the expression measurements and independently of external information such as genomic 
alignments of the probes. The Robust Probabilistic Averaging (RPA) package provides 
tools to quantify and investigate probe affinity and stochastic noise levels based on the 
framework introduced in [7], where detailed analysis of probe-level parameters was used 
to quantify relative contributions from known probe-level error sources such as SNPs, GC- 
content, genomic mismatches and probe interrogation position along the target sequence. 
In many cases the source of probe contamination remained unknown, however, highlight- 
ing the need for data-driven methods to assess probe performance. The implementation 
provides tools to assess probe performance and tools to guide microarray preprocessing 
and probe design in the future studies. 
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2 Robust Probabilistic Averaging 



RPA assumes a Gaussian model for probe effects with a probe-specific mean and variance 
parameter [7]. These parameters are directly interpretable as a constant affinity term 
and a stochastic noise level for each probe (Figure 1). The affinity parameter indicates 
constant shift of the probe-level signal from the probeset-level estimate; low and high 
binding affinities will lead to reduced and increased signal intensities with respect to the 
probeset-level signal, respectively. The stochastic noise parameter quantifies the overall 
accuracy of a probe with respect to the probeset-level signal shape which is common for 
all probes in a probeset. Probes with smaller variance will follow the probeset-level signal 
shape more accurately than noisy probes with high variance. Similar affinity parameters arc 
utilized for instance by the widely-used RMA preprocessing algorithm which, in contrast 
to our model, assumes an equal stochastic noise level for all probes and focuses on probe 
summarization, while our model is particularly designed for probe performance analysis [7] . 
Another key difference between RPA and standard probe-level models, such as RMA is the 
use of probe-level differential expression estimates, which have been shown to improve cross- 
platform comparability [3, 7]. The model gives tools to quantify probe-level effects and to 
assess the relative contributions of the many factors that can affect probe performance [7] . 

Each probe in a probeset is assumed to capture the underlying target signal with probe- 
specific binding affinity and noise level. Probe performance can be assessed by investigating 
the observations across multiple arrays. More formally, the probe-level signal for probe j 
in sample i is modeled in terms of a constant intercept term /i, shape parameter di, probe 
affinity //j and stochastic probe- level noise eij ~ A^(0, r?) as Sij = jj, + di + /ij + Sij. The 
first step of the analysis consists of estimating the signal shape d = [di, . . . , djv] and probe- 
specific variances = [r^, . . . , r|,]. Considering the differential expression profile of a 
given probe j, with respect to an arbitrarily selected reference array r, the parameters ^ 
and cancel out. This allows efficient estimation of d and r^; uncertainty in the reference 
effect £rj is marginalized out [7]. 

Since its original publication the model has been extended to estimate the remaining terms 
in the probe-level model. We require that the expected affinity effect of the probes tends 
to be close to zero by assuming a Gaussian prior fij ~ N{0, cr|) for the affinities. Then the 
expected sum of probe-specific affinities fij is zero. This gives a probabilistic interpretation 
for RMA, which would obtained by setting identical prior for all probes. However, 
instead of giving equal weight for all probes in affinity estimation, we consider an alternative 
approach, where the probes are weighted according to their noisiness by setting ct? = tJ. 
This yields a more flexible but robust model that takes into account the varying degrees 
of reliability of individual probes, where the expectation of probe affinities will remain 
at zero but more noisy probes will have less effect on the estimated signal level /i. In 
the limit of large sample size, the solution will converge to the mean of the probc-lcvcl 
observations weighted by probe-specific variances that quantify the noise level on each 
probe; the probabilistic formulation is robust to uncertainties in the data when the sample 
size is limited. The probe-level estimates quantify probe performance; the probeset-level 
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signal estimate is useful for preprocessing purposes. Moreover, hyperpriors of the model 
parameters allow incorporation of prior information on the probes in the analysis. 

RPA assumes background-corrected, normalized, and log-transformed probe-level data. 

The BioConductor package provides standard tools for preprocessing, including support 
for alternative CDF environments and allows compatible downstream analysis with other 
BioConductor tools. By default, RPA uses the standard RMA background correction [5] 
and quantile normalization [1]. See the package documentation for further options. 



3 Experimental validation 



Comparisons to known probe-level error sources, such as SNPs, GC-content and genomic 
mismatches, have been used to validate the estimates of probe-specific noise, and RPA has 
previously been shown to enhance cross-platform comparability in differential gene expres- 
sion studies [7]. To further validate the model we calculated here the average ranking of 
various preprocessing methods across the 14 tests on AffyCompII [2] on two spike-in data 
sets (Supplementary Tables 1-2). Notably, while RPA is primarily targeted at probe per- 
formance analysis it also outperformed many widely- used preprocessing algorithms, such 
as RMA [6], which supports the validity of the probe-level model. Also certain other algo- 
rithms, including variants of GC-RMA (Wu and Irizarry, 2001) and FARMS (Hochreiter, 
2006), outperformed RMA. The differences in preprocessing performance reflect the com- 
plexity of probe-level models: FARMS and GCRMA have more detailed models for probe 
effects, and RMA is obtained as a special case of the RPA algorithm when the stochastic 
noise is assumed equal between all probes. 



4 Conclusion 



Robust Probabilistic Averaging provides tools for probe performance analysis and prepro- 
cessing. In contrast to standard preprocessing packages, RPA provides explicit data-driven 
estimates of the affinity and noise level for individual probes and tools to interpret this 
information. The information can be used to assess the relative contributions from dif- 
ferent probe-level noise sources, to guide preprocessing and to verify the end results of a 
microarray study. Better understanding of the probe-level effects can ultimately lead to 
improved probe and microarray design. 
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